On wavelet-based features for classifying mass spectrometry data

CSDA Lab, Mathematics and Statistics Department, University of West Florida

Michael Carnival, MS and Achraf Cohen, PhD

Outline

  • Introduction
  • Related Work
  • Methods
  • Results
  • Conclusions

Introduction

High dimensional data refers to data with large number of features (co-variates) \(p\), formally we can write data is \(\mathbf{X} \in \mathbb{R} ^{n\times p}\):

\[ p \gg n, \tag{1} \] where \(n\) is the number of observations.

Introduction

In this context, many challenges arise:

  • Sparse Data
  • Overfitting
  • Feature Redundancy and Correlation
  • Increased Training and Modeling time
  • Difficultly in Visualization
  • Class Imbalance
  • Noisy Features

Introduction

Some solutions in the literature:

The problem in hand

Our data is a mass spectrum signal data (functional data).

  • \(p >90000\) features or dimensions and \(p <100\) subjects or observations

  • Our goal is develop a methodology to classify the functional data, such as cancer mass spectrum data (Cancer is the second leading cause of death in the world after heart disease)
  • Our focus is to leverage the capabilities of Wavelet Analysis.

Wavelet Analysis

The Fourier Transform of a signal \(x(t)\) can be expressed as:

\[ X(f)= \int_{-\infty}^{\infty} x(t) e^{i2 \pi ft} dt \tag{2} \](\(e^{ix}= \cos x + i \sin x\), Euler’s formula); \(f\) is the frequency domain.

The Wavelet Transform of a signal \(x(t)\) can be given as:

\[ WT(s,\tau)= \frac{1}{\sqrt s}\int_{-\infty}^{\infty} x(t) \psi^*\big(\frac{t-\tau}{s}\big) dt, \tag{3} \]

where \(\psi^*(t)\) denotes the complex conjugate of the base wavelet \(\psi(t)\)); \(s\) is the scaling parameter, and \(\tau\) is the location parameter.

Example: Morlet Wavelet \(\psi(t) = e^{i2 \pi f_0t} e^{-(\alpha t^2/\beta^2)}\), with the parameters \(f_0\), \(\alpha\), \(\beta\) all being constants.

Wavelet Families

Discrete Wavelet Transform

Methods

A workflow for ML is the following:

  1. Data Collection

  2. Data Processing: Clean, Explore, Prepare, Transform

  3. Modeling: Develop, Train, Validate, and Evaluate,

  4. Deployment: Deploy, Monitor and Update

  5. Go to 1.

We designed a statistical experiment to evaluation 4 different processing approaches.

Methods

Variables of the experimental design:

  • Four pre-processing techniques.

    • 5 window sizes.

    • Two are wavelet-based and two are not.

    • 10 wavelets families

  • Four ML Models: Logistic Regression, Support Vector Machine, Random Forest, and XGboost.

  • Two sampling: up and down sampling to overcome the imbalance classes

  • Repeat 100 times each case.

A total of 88000 models were run.

Methods

  • Processing 1 (PROC1): The feature space includes mean, variance, energy, coefficient of variation, Skewness, and Kurtosis; wavelet transform.

  • Processing 2 (PROC2): Same as PROC1 but the feature space will include the first 10 autocorrelation coefficients.

  • Processing 3 (PROC3): Same as PROC1 but without the wavelet transform.

  • Processing 4 (PROC4): Same as PROC2 but without the wavelet transform.

Methods

The performance metrics utilized were:

  • Recall
  • Precision
  • F1-score
  • Accuracy

Data sets

  • Low-mass range SELDI spectra
    • 50 cancer
    • 30 normal

Observed 32,768 m/z values / 33,885 m/z values

Link: https://bioinformatics.mdanderson.org/public-datasets/

Results

Performance across Processing

Characteristic
PROC1
PROC2
PROC3
PROC4
Downsample
N = 20,0001
Upsample
N = 20,0001
Downsample
N = 20,0001
Upsample
N = 20,0001
Downsample
N = 2,0001
Upsample
N = 2,0001
Downsample
N = 2,0001
Upsample
N = 2,0001
precision 0.70 ± 0.10 (0.40,1.00) 0.70 ± 0.10 (0.40,1.00) 0.71 ± 0.12 (0.40,1.00) 0.71 ± 0.12 (0.40,1.00) 0.78 ± 0.12 (0.57,1.00) 0.78 ± 0.12 (0.50,1.00) 0.73 ± 0.10 (0.50,1.00) 0.73 ± 0.10 (0.50,1.00)
recall 0.88 ± 0.16 (0.40,1.00) 0.88 ± 0.16 (0.40,1.00) 0.86 ± 0.17 (0.40,1.00) 0.86 ± 0.17 (0.40,1.00) 0.91 ± 0.12 (0.60,1.00) 0.91 ± 0.12 (0.60,1.00) 0.88 ± 0.17 (0.40,1.00) 0.88 ± 0.17 (0.40,1.00)
F1.score 0.77 ± 0.10 (0.40,1.00) 0.77 ± 0.10 (0.40,1.00) 0.77 ± 0.10 (0.40,1.00) 0.77 ± 0.11 (0.40,1.00) 0.83 ± 0.10 (0.60,1.00) 0.83 ± 0.10 (0.55,1.00) 0.79 ± 0.10 (0.44,0.91) 0.79 ± 0.10 (0.44,0.91)
accuracy 0.68 ± 0.13 (0.25,1.00) 0.68 ± 0.13 (0.25,1.00) 0.68 ± 0.13 (0.25,1.00) 0.68 ± 0.13 (0.25,1.00) 0.77 ± 0.13 (0.50,1.00) 0.77 ± 0.13 (0.38,1.00) 0.71 ± 0.12 (0.38,0.88) 0.71 ± 0.12 (0.38,0.88)
1 Mean ± SD (Min,Max)

All PROC seems to perform similarly across ML techniques, etc.

Results

Performance across window sizes

References

Cohen, Achraf, Chaimaa Messaoudi, and Hassan Badir. 2018. “A New Wavelet-Based Approach for Mass Spectrometry Data Classification.” In New Frontiers of Biostatistics and Bioinformatics, edited by Yichuan Zhao and Ding-Geng Chen, 175–89. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-99389-8_8.
Du, Jianqiang, Xiao-Min Wu, Bo Wang, Heng-Jie Su, Kai Ma, and Hu-Qin Zhang. 2009. “Wavelet Transform and Bagging Predictor Approaches to Cancer Identification from Mass Spectrometry-Based Proteomic Data.” In 2009 3rd International Conference on Bioinformatics and Biomedical Engineering, 1–4. IEEE.
Fleuret, François. 2004. “Fast Binary Feature Selection with Conditional Mutual Information.” Journal of Machine Learning Research 5: 1237–63.
Hoerl, Arthur E, and Robert W Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.
Jolliffe, I. T. 2002. Principal Component Analysis. 2nd ed. Springer.
Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing High-Dimensional Data Using t-SNE.” Journal of Machine Learning Research 9: 2579–2605.
McInnes, Leland, John Healy, and James Melville. 2020. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” https://arxiv.org/abs/1802.03426.
Nguyen, Thanh, Saeid Nahavandi, Douglas Creighton, and Abbas Khosravi. 2015. “Mass Spectrometry Cancer Data Classification Using Wavelets and Genetic Algorithm.” FEBS Letters 589 (24): 3879–86.
Schleif, Frank-Michael, Mathias Lindemann, Mario Diaz, Peter Maaß, Jens Decker, Thomas Elssner, Michael Kuhn, and Herbert Thiele. 2009. “Support Vector Classification of Proteomic Profile Spectra Based on Feature Extraction with the Bi-Orthogonal Discrete Wavelet Transform.” Computing and Visualization in Science 12: 189–99.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.
Vimalajeewa, Dixon, Scott Alan Bruce, and Brani Vidakovic. 2023. “Early Detection of Ovarian Cancer by Wavelet Analysis of Protein Mass Spectra.” Statistics in Medicine 42 (13): 2257–73.
Yu, J. S., S. Ongarello, R. Fiedler, X. W. Chen, G. Toffolo, C. Cobelli, and Z. Trajanoski. 2005. “Ovarian Cancer Identification Based on Dimensionality Reduction for High-Throughput Mass Spectrometry Data.” Bioinformatics 21 (10): 2200–2209. https://doi.org/10.1093/bioinformatics/bti370.